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The need for domain ontologies in mission critical applications such as risk management and 
hazard identification is becoming more and more pressing. Most research on ontology learning 
conducted in the academia remains unrealistic for real- world applications. One of the main 
problems is the dependence on non-incremental, rare knowledge and textual resources, and 
manually-crafted patterns and rules. This paper reports work in progress aiming to address such 
undesirable dependencies during ontology construction. Initial experiments using a working 
prototype of the system revealed promising potentials in automatically constructing high-quality 
domain ontologies using real-world texts. 

1. Introduction 

Hazard identification is a crucial aspect of risk management. The identification of hazards is 
the prerequisite step to the analysis and treatment of risks. As such, clear definitions on the type 
of risks and the processes involved for hazard avoidance and treatment are necessary. Unam- 
biguous definition enables effective communication, which is crucial in passing on experiences 
and expertise to trainees and students dealing with dangerous chemicals and products. However, 
very often such knowledge is embedded in the domain experts' mind, or scattered in various 
format, e.g. operation notes, online resources, scientific publications or technical reports. An 
integrated knowledge structure known as an ontology is therefore becoming necessary for de- 
scribing the concepts and processes to ease the process of information sharing and reuse. Some 
possible applications of domain ontologies include conceptual document retrieval and decision 
support system. The importance of ontologies to knowledge-based applications has prompted 
an increase in efforts to construct and maintain such knowledge structures. Generally, there are 
two ways of constructing ontologies, namely, manual crafting and automatic discovery. 

Manual construction and maintenance of ontology is often critised for being labour intensive, 
biased and static. Such manual process typically requires multiple domain experts to identify 
the key concepts and processes, and then collaborate with knowledge engineers for effective 
digital representation. The neutrality and representativeness of manually-crafted ontologies is 
also disputable when the domain experts are unable to reach consensus during the knowledge 
engineering process. New changes to the domain are often ignored and cannot be incorporated 
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into the ontology in a timely manner. To address the problems related to manual ontology 
construction and maintenance, several systems have been developed in the past for automat- 
ically constructing domain ontologies. However, these existing systems are mainly based on 
manually-crafted patterns and rules, and non-incremental, static textual resources. Such de- 
pendencies impose great restrictions on the systems' applicability to a wider range of domains. 
Moreover, automatic ontology construction is a relatively new research area. Many techniques 
used in existing systems were borrowed from related fields in Computer Science such as In- 
formation Extraction, Information Retrieval and Text Mining. Direct applications of such tech- 
niques are inadequate in addressing the various peculiarities of converting real-world natural 
language texts to domain ontologies. Ideally, systems should focus on automatically generating 
high-quality lightweight domain ontologies with facilities for manual refinements by domain 
experts. 

In this paper, we introduce a system for automatic ontology construction to promote the estab- 
lishment of a standard vocabulary in the field of risk management. To address the above issues, 
our ontology construction system utilises dedicated techniques for constructing lightweight on- 
tologies, relying only on dynamic textual resources on the Web. The system extracts key con- 
cepts and relations automatically from electronic domain texts to construct domain ontologies. 
The domain texts may be technical reports, operation notes, electronic books and web pages 
covering risk management and hazard identification in chemical engineering processes. The 
lightweight ontology can then be maintained or edited by domain experts using various ontol- 
ogy editing software. The paper is organised as follows. We provide a brief review of related 
work in Section [21 We proceed to elaborate on the system architecture and its strengths in 
Section [3l In Section HI we demonstrate the process of ontology construction by conducting 
an experiment with a working prototype of our system using domain texts constructed from 
ScienceDirect and a textbook. In Section [5l we conclude with an outlook to future work. 

2. Related Work 

In this section, we briefly review some manual efforts and automated systems for constructing 
ontologies. Prior to the rise in popularity of automatic ontology construction, domain experts 
engaged in collaborative efforts to create ontologies. One of such pioneering projects is the 
Gene Ontology (GO) [ IJ. Even to date, the impractical constraints imposed by existing au- 
tomatic ontology construction systems and their far from satisfactory results have prompted 
experts to continue working manually to construct high-quality domain ontologies. The Plant 
Ontology Consortium (POC) [llM is one of the more recent handcrafted ontologies which in- 
tegrates a wide range of vocabularies used to describe the anatomy, morphology and growth 
stages of several plants. The European Bioinformatics Institute (EBI) initiated a collective ef- 
fort to construct the Chemical Entities of Biological Interest (ChEBI) ontology [ 9 |. ChEBI is 
an ontology which focuses on molecular entities used to intervene in the processes of living 
organisms. Many of these manual efforts are possible through the widely available ontology 
development tools such as OntoLingua [ 10| and Protege [ 21J. In the related domain of risk 
management, a recent work by [ 12J produced a domain ontology through the manual identifi- 
cation and organisation of key concepts and processes for the domain of hazard identification 
using Protege. 

Besides manual efforts, several ontology construction systems have also been developed in re- 
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cent years which aim at generating domain ontologies. For example, OntoLearn [|26l| employs 
standard natural language processing (NLP) tools and corpus analysis to extract and recog- 
nise domain terms. Lexico- syntactic patterns [ 14J and WordNet [ 20| are utilised to extract 
semantic relations between the terms. Similarly, the Text-to-Onto system [ 13 makes use of 
non-incremental resources such as WordNet, and manually-crafted lexico-syntactic patterns to 
construct ontologies. In order to identify more complex relations, Text-to-Onto employs associ- 
ation rule learning. More recent work from [ 17] extract terms and semantic relations through 
dependency structure analysis. The terms are mapped onto WordNet to obtain bags of senses. 
These senses are then clustered using cosine similarity. Semantic relations that consist of simi- 
lar terms can be generalised using association rule mining algorithms for deducing statistically 
significant patterns. [ |24l conducted a study on clustering and the associated tasks of feature 
extraction and selection, and similarity measurement for constructing ontologies. Contexts, ap- 
pearing as sentences in which the terms occur, are used as features in their study. [ 22J utilise 
dependency structure analysis to extract terms and relationships with the help of a controlled 
vocabulary called the Medical Subject Headings (MeSH) and domain knowledge in the form of 
the Unified Medical Language System (UMLS). 

The reliance on non-incremental lexical resources (e.g. MeSH and WordNet), and the use of 
non-dedicated techniques makes such ontology construction systems inapplicable to real- world 
applications and not portable to other domains. 

3. System Architecture 

In this section, we present the architecture of our ontology construction system which is 
designed to overcome the use of non-dedicated techniques, and the reliance on non-incremental 
resources and manually-crafted patterns and rules. The system is comprised of four main phases 
as shown in Figure [T] (using shaded rectangles). These phases are text cleaning, text processing, 
term recognition and relation discovery. Text cleaning removes noises such as spelling errors, 
abbreviations and improper casings from texts. Text processing then extracts coherent three- 
part structures from the texts using linguistic information. Term recognition uses the extracted 
structures to produce a list of term candidates which are then shortlisted based on their relevance 
to the domain of interest. During the last phase of relation discovery, the semantic relations 
between terms are discovered to construct a domain ontology. The flow of intermediate output 
produced after each phase of processing is shown in Figure |2l As we will show in Section 
m errors introduced at each phase have effects on the performance of subsequent phases. The 
specific functionalities required in each phase are shown using rounded rectangles in Figure [T] 
Several techniques developed as part of this research for ontology construction are shown using 
white rounded rectangles while existing techniques and resources required by the system are 
depicted as shaded rounded rectangles. Unlike conventional ontology construction systems, our 
techniques were specifically developed to handle the peculiarities of the input at each phase. We 
will elaborate more on the innovative aspects of our techniques as we progress along. Besides 
the techniques, the preparation of text corpora is equally important. The system requires two 
sets of text corpora, namely, a contrastive corpus and a domain corpus, which are depicted as 
cylinders in Figured] The contrastive corpus is populated with general and non-domain specific 
electronic collections of texts, and web pages obtained from general news sites through web 
crawling. The domain corpus is built through guided web crawling which harvests scientific 
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Figure 1 . Ontology construction system architecture. The main phases, namely, text cleaning, 
text processing, term recognition and relation discovery in the system are represented using 
dark rectangles. The grey rounded rectangles represent existing techniques and functionalities 
required by the system while the white rounded rectangles depict novel techniques developed 
in this research. 



publications on the Web using key phrases provided by domain experts. Electronic versions of 
domain documents such as textbooks can also be added to the domain corpus. 

Ontology construction begins by performing an optional text cleaning phase using our tech- 
nique known as Integrated Scoring for Spelling error correction, Abbreviation Expansion and 
Case restoration (ISSAC) [|29l|33l. ISSAC is built upon the famous spell checker Aspell [ 2] for 
simultaneously providing solution to spelling errors, abbreviations and improper casing. The 
technique has demonstrated high accuracy [|29l|33]| in correcting these three types of noises. 75- 
SAC combines weights based on various sources such as online abbreviation dictionarjQ, string 
correction algorithm [|271, contextual information, and co-occurrence analysis to determine the 
best replacement for each error word. Texts from edited or reputable sources such as academic 
journals can bypass this text cleaning phase. Many linguistic analysis tools and the systems 
that rely on them assume that the input texts are free from spelling errors, abbreviations and im- 
proper casings. However, the presence of such noises is inevitable in real-world texts, especially 
those from online sources. The inclusion of a text cleaning phase in our ontology construction 
system ensures robustness and better performance when dealing with online texts. 

Next, domain texts, which can be a selected portion of the domain corpus, are then fed into the 
text processing phase. The text processing phase is a combination of various natural language 
processing techniques to extract three-part structures known as ternary frames. The modules 
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Figure 2. The intermediate output produced by the system at each phase. The text processing 
phase produces ternary frames in the form of < argl, connector, arg2 > from natural lan- 
guage texts. The term recognition phase then identifies domain-relevant terms using the frames. 
Finally, the relation discovery phase identifies associations between the domain terms for con- 
structing a lightweight domain ontology. 



involved during text processing are sentence parsing, noun phrase chunking and frame extrac- 
tion. The sentence parsing step unveils part-of-speech tags and dependency structures to enable 
the identification and grouping of noun phrases during noun phrase chunking. An example of 
a dependency structure, in the form of a parse tree is shown in Figure SI Existing noun phrase 
chunking techniques typically employ dependency structure analysis and simple word associa- 
tion measures based on static text corpora to identify stable sequences of noun phrases. Such 
techniques are unable to handle head nouns with post modifiers such as prepositional phrase 
and conjunctions. For the accurate chunking of noun phrases, we incorporated two measures 
known as Unithood (UH) [ 32] and Odds of Unithood (OU) [[361 for determining the colloca- 
tion strength of word sequences. For example, the two measures help to decide that the phrase 
"Hazard and Operability Study" should remain as an individual stable unit, while the word 
"hazard" and "risk" in "hazard and risk" should not be combined. The OU measure is the 
probabilistic reformulation of the ad-hoc combination of unithood evidence by UH. The frame 
extraction step then extracts ternary frames in the form of < argl, connector, arg2 >. A gen- 
eral set of linguistic rules are used to determine the presence of such frames from the parsed 
texts. These rules are summarised in Figure [3l Using these rules and the example dependency 
structure in Figure HI we can extract several ternary frames as shown in Figure [51 

Thirdly, noun phrases appearing as arguments in the ternary frames are gathered to form a 
list of term candidates for further processing. The set of term candidates {"team", "several 
hazardous chemical", "new process", "Process Hazards Analysis"} can be obtained from the 
example in Figure [51 The purpose of the term recognition phase is to identify terms from the 
list of candidates which are relevant to and representative of the domain of interest. The subjec- 
tive nature of term relatedness makes termhood determination a challenging issue to address. 
Several measures for determining termhood have been developed in the past with limited accu- 
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Figure 3. A general set of rules for identifying ternary frames in the form of < 

argl, connector, arg2 > from parsed texts. 




Figure 4. The dependency structure, in the form of a parse tree for the sentence "The team 
identified several hazardous chemicals in the new process through Process Hazards Analysis 
(PHA).". 
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Figure 5. The four ternary frames extracted from the dependency structure shown in Figure |4] 
based on the rules in Figure |3l 
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racy. In this regard, we have developed two measures, namely, Termhood (TH) [[STlESl and 
Odds of Termhood (OT) [ 30 1 for determining the degree of relatedness of terms to a specific 
domain. These two measures address various issues which were often neglected such as the im- 
portance of modifiers in determining the semantics of a term, the difference between the notion 
of prevalence and tendency, and the role of contextual information in determining termhood. 
These two measures make use of the distributional behaviour of term candidates within the tar- 
get domain corpus and also across other domains (i.e. contrastive corpus) as statistical evidence 
to quantify various important linguistic evidence. Our two termhood measures have been shown 
to perform with high accuracy in comparison to two other existing measures [ [351 . 

The final phase in our ontology construction system is relation extraction. In this phase, we 
discover unnamed relations between terms through term clustering using an algorithm known as 
Tree-Traversing Ant (TTA) [[34l[37l|. TTA is a hybrid technique inspired by ant-based method 
r [T3l and conventional hierarchical clustering. TTA is capable of further distinguishing hidden 
structures within clusters and is tolerant to differing cluster size. The ability to identify and 
isolate outliers, and to produce consistent results makes TTA a reliable term clustering tech- 
nique. Unlike conventional systems which require the computationally intensive task of feature 
extraction and selection, our system employs two featureless measures, namely, n° ofWikipedia 
(noW) [i28J and Normalised Google Distance (NGD) [ 6] for similarity and distance measure- 
ment. Using the relations extracted from this phase, we can then organise the flat list of domain 
terms discovered from term recognition into a graph structure to produce a lightweight domain 
ontology. This graph structure can be converted into various format using languages such as 
OWL. This ontology can be edited and maintained using ontology editors such as Protege [ 

m. 

4. Evaluations and Discussions 

For the evaluation, we employ a dataset containing a domain corpus describing "risk man- 
agement" and a contrastive corpus. We constructed a set of about 7, 600 documents extracted 
from ScienceDirect using 34 keywords provided by experts. These documents together with 
the electronic version of a risk management textbook contribute to the domain corpus. The 
contrastive corpus is comprised of about 28, 000 news articles crawled from Reuters, CNet and 
Discovery, and five off-the-shelf corpora namely GENIA [|T8l, BioCreative [[151, BioMedCen- 
tral [[H, Reuters-21578 [[23l and British National Corpus [[H. Table 1 summarises the dataset 
used in our evaluation. We use the risk management textbook, which is part of the domain 
corpus, as input to our ontology construction system due to the authority and reliability of its 
content. We feed the textbook into our text processing phase to obtain ternary frames in the form 
of < argl, connector, arg2 >. Over 22, 000 ternary frames were extracted from the textbook. 
For practical reasons, we randomly selected 4, 000 frames for further processing. A list of term 
candidates is then constructed from the 4, 000 ternary frames by selecting distinct noun phrases 
from argl and arg2. The resulting set TC contains 2, 841 term candidates. In the following 
two subsections, we will briefly discuss the process of term recognition and term clustering, and 
the evaluation results associated to each phase. 

4.1. Performance of Term Recognition 

In the second phase of ontology construction, we perform term recognition using four ter- 
mhood measures, namely, OT TH [[3ll, CW [[1 and NOV [[ID on the set of 2, 841 
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Table 1. The two types of corpora used in our evaluation, and their content. 
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Total 
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term candidates. In this phase, the term candidates are systematically assessed and assigned 
weights to reflect their relevance to the domain represented by the domain corpus. The term 
candidates are then ranked according to the weights assigned to them. We evaluated the per- 
formance of term recognition from two perspectives. We conducted a qualitative evaluation by 
analysing the frequency distribution of the terms to determine if terms are properly weighted 
according to their distributional behaviour across different domains. The second evaluation 
examines the performance of term recognition quantitatively with the help of domain experts. 

In the first part of the evaluation, we analyse the frequency distributions of the ranked term 
candidates generated by the four measures. Generally, terms which occur more frequently in the 
domain corpus than in the contrastive corpus should be assigned higher weights. Other factors 
such as the domain relatedness of heads and context should also be taken into consideration. 
Figure [6] show the frequency distributions of the term candidates ranked in descending order 
according to the weights assigned by the respective measures. In this evaluation, a measure is 
considered as capable of identifying highly related terms if the corresponding graph shows high 
degree of polarisation of the oscillating lines. The dark oscillating lines represent the domain 
frequencies f^i while the grey oscillating lines are contrastive frequencies /j. Terms which are 
assigned higher weights are considered as more relevant and are located to the left of the x- 
axis. Ideally, terms along the start of the x-axis should have very high domain frequency fd 
(i.e. located higher on the y-axis) and relatively lower contrastive frequency /j (i.e. lower on 
the y-axis). One can notice the interesting trends from the graphs by CW and NCV in Figure 
[6l The first half of the graph by CW, prior to the sudden surge of frequency, consists of only 
complex terms (i.e. multi-word terms). The relatively lower word count of complex terms as 
compared to simple terms explains for such disparity in the frequency distribution produced by 
CW. This is attributed to the biased treatment given to complex terms evident in the formulation 
of the CW measure. However, priority is also given to complex terms by TH but as one can 
see from the distribution of candidates by TH, such undesirable trend does not occur. One of 
the explanations is the heavy reliance of frequency by CW while TH attempts to diversify the 
evidences in the computation of weights. While frequency may be a reliable source of evidence, 
the use of it alone is definitely inadequate [i5J. As for NCV, Figure [6] reveals that scores are 
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Figure 6. Distributions of the 2, 841 terms extracted from the textbook, which is part of the 
domain corpus, sorted according to the corresponding scores provided by the four measures. 
The single dark smooth line stretching from the left (highest value) to the right (lowest value) 
of the graph is the scores assigned by the respective measures. As for the two oscillating lines, 
the dark line is the domain frequencies fa while the light one is the contrastive frequencies /j. 

assigned to candidates by NCV based solely on the domain frequency. In other words, the 
measure NCV lacks the required contrastive analysis. As we have pointed out, terms can 
be ambiguous and we must not ignore the cross-domain distributional behavior of terms. In 
addition, upon inspecting the actual list of ranked candidates, we noticed that higher scores are 
assigned to candidates which are accompanied by more context words. Another positive trait 
that TH exhibits is its ability to assign higher scores to terms that occur relatively more frequent 
in the domain corpus than in the contrastive corpus. This is evident through the gap between 
(dark oscillating line) and /j (light oscillating line), especially at the beginning of the x-axis. 
One can notice that candidates along the end of the x-axis are those with /j > f^. The same can 
be said about our new measure OT. However, the discriminating power of OT is apparently 
better since the gap between fd and /j is larger and lasted longer. 

In addition to qualitative assessment, we conducted a quantitative evaluation on the set of 
terms produced during term recognition with the help of domain experts. There are several per- 
formance measures common to the field of Information Retrieval and Information Extraction 
such as precision, recall, F-measure, and accuracy. These measures are computed by construct- 
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Table 2. An example of a contingency table. Note that \TC\ = TP + FP + FN + TN 
where \TC\ is the total number of term candidates in the input set. 
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ing a contingency table as shown in Table 2: 

TP 

TP + FP 
TP 

TP + FN 

2 X precision x recall 
precision + recall 
TP + TN 
TP + FP + FN + TN 

where TP, TN, FP and FN are values from the four cells of the contingency table shown in 
Table 2. Considering the manpower constraint in manual assessment, we limit our evaluation to 
only the top n = 300 ranked term candidates from each list produced by each measure. In this 
case, the "retrieved" portion of the "actual results" in the contingency table refers to the top 
300 terms from each measure. The "not retrieved" portion in the table cannot be defined since 
there is no standard set of terms available through benchmarks or gold standards for our domain 
of interest. Due to the absence of the two pieces of information TN and FN, the precision 
measure is the only applicable performance indicator of term recognition in this paper. The 
precision at n is the fraction of the top n term candidates that are considered as relevant to the 
domain. We seek the help of experts in the domain of "risk management" to decide if the top 
300 terms by each measure are actually relevant to the domain. The domain experts performed 
a binary classification by deciding if a term is relevant or not relevant to our domain of interest. 
We organised the results in Table 3. The TP row shows the number of terms ranked within 
the top 300 which are actually relevant to the domain of "risk management" . FP, on the other 
hand, contains the number of non-relevant terms. As shown in Table 3, term recognition using 
the two measures presented as part of our system (e.g. OT and TH) offer far more precise 
results compared to existing measures. Out of the top 300 terms ranked by OT and TH, about 
60 — 70% of them are considered as relevant to the domain of "risk management" as compared 
to only 10 — 30% precision by other measures. The actual list of term candidates and the expert's 
evaluation is available on our research site@. Despite the mediocre performance figure delivered 
by our two measures in this evaluation, OT and TH performed extremely well in view of the 
following reasons: 

• Issues such as coverage and sparsity of the domain corpus have tremendous impact on 



precision = 
recall = 
Fi = 
accuracy = 
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the performance of term recognition. The domain corpus used in this evaluation was 
constructed automatically from various sources without any attempt to assess its adequacy 
in these aspects. 

• The term candidates in this evaluation were automatically extracted from real-world texts 
without human intervention. The text processing phase and specifically, the extraction 
of term candidates have errors of their own (e.g. incorrect noun phrase chunking). Such 
errors will inevitably propagate to the next phase of term recognition. 

• In manual assessment, evaluation results vary depending on the human experts. Our 
domain experts imposed stringent requirements during their inspections of the ranked 
terms. The terms are only classified as relevant if the experts are absolutely certain of the 
terms' significance to the domain of "risk management" . 



Table 3. The performance of term recognition for all four measures in this evaluation. 
The TP and FP rows contain the number of relevant terms and the number of terms 
considered as not relevant by the domain experts, respectively. 





OT 


TH 


CW 


NCV 


TP 


178 


202 


79 


22 


FP 


122 


98 


221 


278 


n 


300 


300 


300 


300 


Precision 


59.33 


67.33 


26.33 


7.33 



4.2. Performance of Relation Discovery 

In the second part of the evaluation, we performed relation discovery to identify unnamed 
associations between the domain terms produced during term recognition. We employed two 
metrics in ontology learning for evaluation. The first is known as Lexical Overlap (LO) for 
evaluating the intersection between the set of discovered concepts (Cd) and the benchmark 
concepts (Cm) [ 19J. LO is defined as: 



LO = (1) 

\0m\ 

where |C| is the number of concepts in set C. The second metric known as Ontological Loss 
(OL) is used to identify the number of benchmark concepts that were not discovered during 
term clusterings. OL is defined as [^25|: 



(2) 
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Figure 7. The 21 upper- level concepts in the hand-crafted domain ontology for hazard identifi- 
cation methods in risk management [[EJ. These concepts are used as benchmark concepts for 
evaluation. 



To ensure that high-quality domain ontology is produced, the top n = 3Q terms ranked by 
TH which has 100% precision were automatically selected for term clustering. We employ 
a manually-crafted domain ontology for hazard identification methods in risk management [ 
[T2l as a benchmark for comparison. The benchmark ontology has a set of 21 concepts as 
shown in Figure U\ However, we selected only 16 concepts from the benchmark ontology 
for comparison or in other words, \Cm\ = 16. The 5 concepts excluded were "concept safety 
review", "critical examination of safety system", "preliminary process hazard analysis" , "pilot 
plants " and "step-by-step method" . The terms which represent these concepts, and other related 
terms which can be generalised to form these concepts were not present in the textbook. Such 
constraints on the part of the textbook can result in the misintepretation of the actual capability 
of the system proposed in this paper. 

The lightweight ontology created using our term clustering technique is shown in Figure 
[81 Our clustering technique managed to discover 11 out of the 16 benchmark concepts. For 
this experiment alone, our system achieved a lexical overlap of 11/16 = 69%. The system 
experiences a 5/16 = 31% ontological loss due to the non-discovery of 5 benchmark concepts. 
The non-discovered concepts were "hazard indices ", "preliminary hazard analysis ", "safety 
audits", "sneak analysis" and "task analysis". Upon detail examination, we identified the 
causes behind the system's inability to discover these 5 concepts: 

• "preliminary hazard analysis" was mentioned only once as plain text (i.e. not part of a 
figure or table) in the textbook, and was not part of the randomly selected 4, 000 frames 
for term recognition. 

• "safety audits" does not occur in the textbook. Possible related terms such as "process 
safety audit" and "operational safety audit" that could assist in discovering a generalised 
concept have only one to two occurrences in the textbook, and were excluded from the 
4, 000 frames. 
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Figure 8. The clustering result produced by our system. The flat list of terms produced by 
term recognition is organised into a graph structure to form a lightweight domain ontology. 
The main concepts discovered by the system are labelled as CI to Cll. These discovered 
concepts correspond to 11 out of the 21 benchmark concepts. Concepts CI to C9 are exact 
matches of the benchmark concepts while CIO and Cll are concepts discovered through gen- 
eralising related terms. For instance, concept CIO is an abstract representation of "human 
reliability analysis " obtained through generalising related techniques, namely, technique for 
"human error rate prediction" (THERP) and "human error assessment and reduction tech- 
nique" (HEART). The concept representing "checklists" Cll is produced after two levels of 
generalisation. Terms related to the individual elements of "checklists" were first clustered to 
form the corresponding elements, which in turn were used to discover the main "checklists" 
concept. There were 11 "checklists" sub-concepts (e.g. Cll.l to Cll. 11) generated. Sub- 
concepts Cll.l to Cll. 11 represent "chemical reactors", "fire protection" , "human/actors & 
human errors", "incident investigation", "management system" , "personal safety" , "physico- 
chemical property", "plant start-up & shutdown", "pressure system design", "storage" and 
"transport", respectively. 
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• "sneak analysis " and "task analysis " have low occurrences in the textbook, and were not 
included for term recognition. 

• "hazard indices" was not extracted as part of any frame due to its absence from the 
textbook. Possible related terms such as "chemical exposure index" and "instantaneous 
fractional annual loss " have less than ten occurrences in the book and were excluded 
from the 4, 000 frames. Other terms such as "runaway reaction hazard index" and "mor- 
tality index" which could help in discovering a generalised concept do not appear in the 
textbook. Another useful term "fire and explosion index", which was mentioned in the 
book, was not included for term recognition as a complete term due to an error with noun 
phrase chunking during text processing. The term was extracted as two separate parts 
"fire" and "explosion index" in the 4, 000 frames. 

In short, the three general reasons behind the system's inability to discover the desired concepts 
are 1) inadequate statistical evidence due to low occurrence in text corpora, 2) incidental exclu- 
sion during input at each phase, and 3) errors introduced at each phase. Only the third reason is 
the result of our system faults. The first reason is due to the inadequacy of text corpora and is 
not related to the system. The second reason is caused by user-imposed constraints during input 
where useful information is unintentionally filtered out. For example, in this experiment, we 
restricted the input of term recognition to only 4, 000 ternary frames. In addition, only the top 
36 ranked terms were further processed to construct the lightweight domain ontology. Referring 
back to our specific causes above, 4 out of the 5 undiscovered concepts in this experiment were 
the results of the first and second reasons, which are unrelated to our system. Notwithstanding 
these 4 concepts (i.e. \Cm\ — 4), our system is in fact capable of performing at an outstanding 
lexical overlap of ll/(|Cm| — 4) = 92%. In other words, there is only an 8% ontological loss 
that is actually caused by our system faults. 

5. Conclusions and Future Work 

Domain ontologies in risk management for chemical processes is becoming increasingly im- 
portant to support information sharing and reuse among chemical engineers and information 
systems. In this paper, we presented and evaluated an automatic lightweight ontology construc- 
tion system based on dedicated text mining techniques. The system is composed of four main 
phases, namely, text cleaning, text processing, term recognition and relation discovery. The 
techniques in each phase were designed to employ only dynamic resources such as Wikipedia 
and Google to ensure applicability to different domains. Our evaluations using real- world texts 
have shown that the proposed system is capable of automatically constructing high quality 
lightweight domain ontologies. The performance of lightweight ontology construction using 
our system can be improved tremendously by 1) diversifying and verifying the sources of term 
candidates, and 2) increasing the size and improving the quality of text corpora used for term 
recognition. 

The constructed domain ontology is a valuable asset for various applications in risk man- 
agement. One of such applications is ontology-based document indexing and retrieval. We are 
planning to employ the domain ontology to perform document indexing on the domain texts. 
The texts can be indexed using the concepts in the ontology, and related documents can then be 
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easily determined using the relations in the ontology. Such facility offers a conceptual view of 
the documents in a collection for improving document searchability. 
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